全端 LLM 應用開發-Day17-用 Pinecone 儲存向量資料

15th鐵人賽

大魔術熊貓工程師

2023-10-02 09:11:59

2443 瀏覽

分享至

全端 LLM 應用開發-Day17-用 Pinecone 儲存向量資料

昨天我們完成 Pinecone 的基礎設定了，今天我們就來開始用 SDK 來寫程式了。

先來安裝 Pinecone 的 SDK，使用指令 poetry add pinecone-client 。
Pinecone 的 index，是一個向量空間，我們可以簡單想成資料表的概念，我們會把許多的向量儲存在這個 index 裡面。昨天我們在 Pinecone 的網站建立了一個叫做 ironman2023 的 index，你也可以透過下面的程式碼來建立。

import pinecone      

def create_index(index_name):
    if index_name not in pinecone.list_indexes():
        pinecone.create_index(
            index_name,
            dimension=1536,
            metric='cosine'
        )
        # 等待 index 建立完成
        while not pinecone.describe_index(index_name).status['ready']:
            time.sleep(1)

接著我們建立一個檔案叫 pinecone_tutorial.py，使用以下的程式碼來連上 pinecone 的 index。

def init_pinecone(index_name):
    pinecone.init(
        api_key='yourkey',
        environment='gcp-starter'
    )
    index = pinecone.Index(index_name)
    return index

然後我們再來寫一個 function 來插入資料，這裡是用 upsert 這個 API。要注意在向量資料庫裡，一般都可以再插入 metadata 的 json，做為儲存原始的文本的地方。還記得我們幾天的範例都是要再額外把原始文本加入到 list 裡，不過在 Pinecone 這類的向量資料庫就不用了。在一開始儲存時，就可以把原始文本放進向量資料庫。

def add_to_pinecone(index, embeddings, text_array):
    ids = [str(i) for i in range(len(embeddings))]
    embeddings = [embedding for embedding in embeddings]
    ids_embeddings_tuple = zip(ids, embeddings)

    text_array_to_metadata = [{"content": text} for text in text_array]

    # 插入的資料會像這樣子
    # ("2", [0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5], {"content": "歌詞"})
    ids_embeddings_metadata_tuple = zip(
        ids, embeddings, text_array_to_metadata)

    index.upsert(ids_embeddings_metadata_tuple)

接著我們再寫一個 search 用的 function

def search_from_pinecone(index, query_embedding, k=1):
    results = index.query(vector=query_embedding,
                          top_k=k, include_metadata=True)
    return results

再來就是我們的 main 了，把問題轉成向量，然後來搜尋。

def main():
    EMBEDDING_MODEL_NAME = "embedding-ada-002"  # 你幾天前在 Azure OpenAI 上建立的模型名稱
    openai.api_base = "https://japanopenai2023ironman.openai.azure.com/"
    openai.api_key = "yourkey"
    openai.api_type = "azure"
    openai.api_version = "2023-03-15-preview"

    text_array = ["我會披星戴月的想你，我會奮不顧身的前進，遠方煙火越來越唏噓，凝視前方身後的距離",
                  "而我，在這座城市遺失了你，順便遺失了自己，以為荒唐到底會有捷徑。而我，在這座城市失去了你，輸給慾望高漲的自己，不是你，過分的感情"]

    embedding_array = [get_embedding(
        text, EMBEDDING_MODEL_NAME) for text in text_array]
    
    index = init_pinecone("ironman2023")

    # add_to_pinecone(index, embedding_array, text_array)

    query_text = "工程師寫城市"
    query_embedding = get_embedding(query_text, EMBEDDING_MODEL_NAME)
    result = search_from_pinecone(index, query_embedding, k=1)

    print(f"尋找 {query_text}:", result)

if __name__ == '__main__':
    main()

可以看到結果是：

尋找 工程師寫城市: {'matches': [{'id': '1',
              'metadata': {'content': '而我，在這座城市遺失了你，順便遺失了自己，以為荒唐到底會有捷徑。而我，在這座城市失去了你，輸給慾望高漲的自己，不是你，過分的感情'},
              'score': 0.791866839,
              'values': []}],
 'namespace': ''}

這個結果裡的 namespace 是可以在 upsert 資料時加上的，例如說我可以 index.upsert(vectors=ids_embeddings_metadata_tuple, namespace='告五人')

理解 Pinecone 的 pod

接著我們來理解 Pinecone 的 pod.

初始方案（Starter plan）
使用初始方案時，我們可以創建一個 pod，該 pod 具有足夠的資源來支援大約100,000 個具有 1536 維 embedding 和 metadata 的向量。很佛心剛好就是 OpenAI text-embedding-ada-002 的維度。

在使用初始方案時，所有的 create_index 調用都會忽略 pod_type 參數，因為只能創建一個 index 🥲🥲。

s1 pods

S1儲存為優化的 pods 提供了大容量的存儲和較低的總成本，但查詢延遲稍高於 p1 pods。它們非常適用於具有適度或寬鬆延遲要求的非常大的索引。

每個 s1 pod 可容納大約 500 萬個 768 維度的向量。

p1 pods
P1 提供非常低的查詢延遲，但每個 pod 的向量數量較 s1 pods 少。它們非常適用於具有低延遲要求（小於100毫秒) 的應用程式。

每個 p1 pod 可容納大約 100 萬個 768 維度的向量。

p2 pods
P2 類型提供更高的查詢吞吐量和較低的延遲。對於維度少於 128 且 topK 少於 50 的向量和查詢，p2 pods 每個副本支援高達 200 QPS，並在少於10毫秒的時間內返回查詢。這意味著查詢吞吐量和延遲比 s1和 p1更好。

每個 p2 pod 可容納大約 100 萬個 768 維度的向量。但是容量可能會隨維度而變化。

P2 的速率明顯慢於 p1 pods；隨著維度的增加，此速率會降低。例如，128 維度向量的 p2 pod 可以每秒最多更新 300 次；但是 768維度或更多向量的 p2 pod 支援每秒最多 50 次更新。由於 p2 pods 的查詢延遲和吞吐量與 p1 pods 不同，因此需要使用您的資料集測試 p2 pod 的性能。

每種 pod 類型支援四種 pod 大小：x1、x2、x4和 x8。每個間隔、索引存儲和計算容量都會翻倍。默認的 pod 大小是 x1，可以在創建 index 後增加 pod 的大小。